Similarity Measures Based on Latent Dirichlet Allocation
نویسندگان
چکیده
We present in this paper the results of our investigation on semantic similarity measures at wordand sentence-level based on two fully-automated approaches to deriving meaning from large corpora: Latent Dirichlet Allocation, a probabilistic approach, and Latent Semantic Analysis, an algebraic approach. The focus is on similarity measures based on Latent Dirichlet Allocation, due to its novelty aspects, while the Latent Semantic Analysis measures are used for comparison purposes. We explore two types of measures based on Latent Dirichlet Allocation: measures based on distances between probability distribution that can be applied directly to larger texts such as sentences and a word-to-word similarity measure that is then expanded to work at sentence-level. We present results using paraphrase identification data in the Microsoft Research Paraphrase corpus.
منابع مشابه
Automatic keyword extraction using Latent Dirichlet Allocation topic modeling: Similarity with golden standard and users' evaluation
Purpose: This study investigates the automatic keyword extraction from the table of contents of Persian e-books in the field of science using LDA topic modeling, evaluating their similarity with golden standard, and users' viewpoints of the model keywords. Methodology: This is a mixed text-mining research in which LDA topic modeling is used to extract keywords from the table of contents of sci...
متن کاملObtaining Single Document Summaries Using Latent Dirichlet Allocation
In this paper, we present a novel approach that makes use of topic models based on Latent Dirichlet allocation(LDA) for generating single document summaries. Our approach is distinguished from other LDA based approaches in that we identify the summary topics which best describe a given document and only extract sentences from those paragraphs within the document which are highly correlated give...
متن کاملA Study of Probabilistic and Algebraic Methods for Semantic Similarity
We study and propose in this article several novel solutions to the task of semantic similarity between two short texts. The proposed solutions are based on the probabilistic method of Latent Dirichlet Allocation (LDA) and on the algebraic method of Latent Semantic Analysis (LSA). Both methods, LDA and LSA, are completely automated methods used to discover latent topics or concepts from large c...
متن کاملIdentifying Word Translations from Comparable Corpora Using Latent Topic Models
A topic model outputs a set of multinomial distributions over words for each topic. In this paper, we investigate the value of bilingual topic models, i.e., a bilingual Latent Dirichlet Allocation model for finding translations of terms in comparable corpora without using any linguistic resources. Experiments on a document-aligned English-Italian Wikipedia corpus confirm that the developed meth...
متن کاملLegal Documents Clustering using Latent Dirichlet Allocation
At present due to the availability of large amount of legal judgments in the digital form creates opportunities and challenges for both the legal community and for information technology researchers. This development needs assistance in organizing, analyzing, retrieving and presenting this content in a helpful and distributed manner. We propose an approach to cluster legal judgments based on th...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2013